First, load the necessary libraries. In addition to the rpack package, we use tidyverse and dplyr for sample data manipulation and ggplot2 for plotting.

#install.packages("lcmix", repos=c("http://R-Forge.R-project.org",
#                                   "http://cran.at.r-project.org"),dependencies=TRUE)
library(rpack)
library(tidyverse)
library(LaplacesDemon)
library(Matrix)
library(plotly)
library(Gmedian)
library(lcmix) # Added by Markku. Package can be found at R-Forge
library(parallel) # For parallel computing. rpack uses function "detectCores".
library(flexmix)

External functions

These are added by Markku. These external functions generate data from gamma distribution. Weights are determined either by 1) chance 2 ) based on the distance from the distribution mean: the greater the Euclidean distance between the point and the population mean is, the larger the weight is. “simulate_unif_grid” is a sub-function of the main function “simulate_gamma_mixture”.

source("CodeCollection/simulate_gamma_mixture.R")
source("CodeCollection/simulate_unif_grid.R")
source("CodeCollection/utility_functions.R")

# Get the colors for plotting
c_col <- get_cols() 

Simple example

For testing purposes, we first generate small and simple data set to test package flexm.

# Simple data set
m1 <- 0
m2 <- 50
sd1 <- sd2 <- 5
N1 <- 100
N2 <- 10

a <- rnorm(n=N1, mean=m1, sd=sd1)
b <- rnorm(n=N2, mean=m2, sd=sd2)

x <- c(a,b)
class <- c(rep('a', N1), rep('b', N2))
data <- data.frame(cbind(x=as.numeric(x), class=as.factor(class)))

Fit two gaussians to the data.

library("ggplot2")
p <- ggplot(data, aes(x = x)) + 
  geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
  geom_vline(xintercept = m1, col = "red", size = 2) + 
  geom_vline(xintercept = m2, col = "blue", size = 2)
p

set.seed(0)

mo1 <- FLXMRglm(family = "gaussian")
mo2 <- FLXMRglm(family = "gaussian")
flexfit <- flexmix(x ~ 1, data = data, k = 2, model = list(mo1, mo2))

# So how did we do? It looks like we got class assignments perfectly

print(table(clusters(flexfit), data$class))
##    
##       1   2
##   1   0  10
##   2 100   0

Lastly, let’s visualize the fitted model along with the data.

c1 <- parameters(flexfit, component=1)[[1]]
c2 <- parameters(flexfit, component=2)[[1]]

#' Source: http://tinyheero.github.io/2015/10/13/mixture-model.html
#' Plot a Mixture Component
#' 
#' @param x Input data
#' @param mu Mean of component
#' @param sigma Standard deviation of component
#' @param lam Mixture weight of component
plot_mix_comps <- function(x, mu, sigma, lam) {
  lam * dnorm(x, mu, sigma)
}

lam <- table(clusters(flexfit))
  
ggplot(data) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
stat_function(geom = "line", fun = plot_mix_comps,
                args = list(c1[1], c1[2], lam[1]/sum(lam)),
                colour = "red", lwd = 1.5) +
stat_function(geom = "line", fun = plot_mix_comps,
                args = list(c2[1], c2[2], lam[2]/sum(lam)),
                colour = "blue", lwd = 1.5) +
ylab("Density")

All good here!

Simulated data

Let’s set up some data to be clustered. To ease cluster overlap, clusters are placed on a grid.

Outliers are sampled uniformly on the cluster grid.

Some simulated clusters have heavy weight points at edge of clusters (50/50 chance).**

# Initialize seed number:
#seed = Sys.time()
#seed = as.integer(seed)
#seed = seed %% 100000
seed = 10376
set.seed(seed)
k = 10 # ten clusters
n = c(20, 40, 60, 80, 50, 80, 60, 40, 20, 50) # uneven cluster sizes
n_out = 20 # nmb of outliers

# Generating 500 points from mixture of 10 gamma distributions.
test_dat = simulate_gamma_mixture(
  n,
  k,
  n_out = n_out,
  out_scale = 5,
  scale_between_range = c(0, 1),
  outgroup_alpha = 0.4,
  place_on_grid = T,
  overlap_scale = 0.5
)
true_mu <- test_dat$mu_true
test_dat <- test_dat$Y

# Ids in interactive plot
id <-  1:nrow(test_dat)

plot_sim <-
  ggplot(data = test_dat, aes(
    x = x,
    y = y,
    size = w,
    label = id
  )) +
  geom_point() +
  scale_size(range = c(2, 6)) +  # Scale objects sizes
  guides(color = guide_legend(# Point size in legend
    override.aes = list(size = 5))) +
  labs(x = "x", y = "y", title = "Unclustered data") +
  theme(
    legend.position = "right",
    # Legend position and removing ticks from axis
    axis.text.x = ggplot2::element_blank(),
    axis.text.y = ggplot2::element_blank(),
    axis.ticks = ggplot2::element_blank()
  )

ggplotly(plot_sim, tooltip = c("id", "w"))

Fit a normal mixture model to the data we just simulated.

formula <- cbind(test_dat$x, test_dat$y) ~ 1
set.seed(1)
test_flexfit <- flexmix(
  formula = formula,
  k = k,
  model = FLXMCmvnorm(diag = T))

print(table(clusters(test_flexfit), test_dat$orig_group))
##    
##      1  2  3  4  5  6  7  8  9 10 11
##   1  0  0  0  0  0 55  0  0  0  0  2
##   2  1 20  0  0 45  1 60  0  8  2  1
##   3 19  0  0  0  0 22  0  0  0  0  0
##   4  0 20  0 77  1  0  0  0  0 47  2
##   5  0  0  2  3  3  1  0  1 12  1 12
##   6  0  0 58  0  0  1  0  0  0  0  3
##   7  0  0  0  0  1  0  0 39  0  0  0

Interestingly, we were able to find only 7 clusters.